Extractions des sujets des reviews¶

Prévisualiastions des datasets¶

Nom du fichier                       Taille du fichier
-----------------------------------  -------------------
yelp_academic_dataset_business.json  0.11Gb
yelp_academic_dataset_checkin.json   0.27Gb
yelp_academic_dataset_review.json    4.98Gb
yelp_academic_dataset_tip.json       0.17Gb
yelp_academic_dataset_user.json      3.13Gb

Prévualisation du DataSet: "business"

business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
46 JX4tUpd09YFchLBuI43lGw Naked Cyber Cafe & Espresso Bar 10303 108 Street NW Edmonton AB T5J 1L7 53.544682 -113.506589 4.0 12 1 {'OutdoorSeating': 'False', 'BusinessParking':... Arts & Entertainment, Music Venues, Internet S... {'Monday': '11:0-1:0', 'Tuesday': '11:0-1:0', ...
214 LVYAXWQB3t7tdwWteyjfhw Option 1 Barber Shop 5537 Sheldon Rd, Ste E Tampa FL 33615 27.998700 -82.582253 4.0 16 0 {'ByAppointmentOnly': 'False', 'BusinessParkin... Barbers, Beauty & Spas {'Monday': '9:0-19:0', 'Tuesday': '9:0-19:0', ...
736 S2LinHvVEXAm2jv84_kXLw St. Louis Artists' Guild 2 Oak Knoll Park Saint Louis MO 63105 38.637678 -90.319549 4.0 9 0 {'RestaurantsPriceRange2': '2', 'BusinessAccep... Festivals, Art Galleries, Local Flavor, Commun... {'Tuesday': '12:0-16:0', 'Wednesday': '12:0-16...
992 7zfO3VB6wEqDnk0_U16uBg Maximum Grow Gardening 6117 E Washington St Indianapolis IN 46219 39.771263 -86.061167 4.0 5 1 {'BikeParking': 'True', 'BusinessAcceptsCredit... Home Services, Home & Garden, Shopping, Garden... {'Monday': '10:30-19:0', 'Tuesday': '10:30-19:...
974 zKBIdA2j49REmU2bFR1mdw BayLife Physical Therapy and Rehabilitation- S... 8950 Doctor Martin Luther King Junior St N, St... St. Petersburg FL 33702 27.854197 -82.647427 5.0 9 1 {'ByAppointmentOnly': 'True', 'BusinessAccepts... Rehabilitation Center, Health & Medical, Physi... {'Monday': '8:0-19:0', 'Tuesday': '8:0-19:0', ...
business_id                                SFKjUQ1gmfwm7cJhMCFmkA
name                                               ZigZag Scallop
address                                          4417 Calienta St
city                                               Hernando Beach
state                                                          FL
postal_code                                                 34607
latitude                                                  28.4977
longitude                                              -82.650015
stars                                                         4.0
review_count                                                  214
is_open                                                         1
attributes      {'Alcohol': ''full_bar'', 'GoodForMeal': '{'de...
categories      Nightlife, Seafood, Bars, Restaurants, America...
hours           {'Monday': '11:30-21:0', 'Thursday': '11:30-21...
Name: 595, dtype: object

Prévualisation du DataSet: "checkin"

business_id date
398 -Awb67JgBbySP4mQtOtNsA 2011-09-24 15:48:25, 2012-04-19 17:24:46, 2012...
292 -7aDp7JsogemWTKOuZdNTw 2011-01-28 18:01:12, 2011-03-18 04:17:53, 2011...
557 -FeeEqYmAJ00G66ngdxFZg 2014-08-24 15:52:47, 2014-09-14 17:59:58
802 -OKB11ypR4C8wWlonBFIGw 2010-03-21 01:26:00, 2010-03-27 05:43:30, 2010...
261 -6qt8a52bBwMogqwZsooOA 2021-06-11 17:36:41, 2021-07-03 16:23:55, 2021...
business_id                               -1K1J_D9eT2dR6BNvQ2Tnw
date           2010-10-19 23:45:27, 2011-12-26 20:17:49, 2012...
Name: 82, dtype: object

Prévualisation du DataSet: "review"

review_id user_id business_id stars useful funny cool text date
787 zBrO_zs81k9U-ZpyR_p8fw FsbQY_iNJPm4xAZ9vERCBw Jx2AoB_IQOUrZ3s6fdAUSA 5 0 0 0 Great Cranberry Orange Muffin and Caffe Latte ... 2017-05-13 13:05:01
263 wEfzqOfbwn4Ohe2ZDOLAzw VMtyZjaEJB9nfmjr4xdVlw GBTPC53ZrG1ZBY3DT8Mbcw 4 1 1 0 First meal in New Orleans. I had the $15 lunch... 2012-11-06 22:28:18
415 ijG3hyvzneIplamARXzfEg TDGx8YhxmF3OP0XH3dYsXw SoJwDKedR7SJh7-G69C38A 5 0 0 0 no contest best Chinese good in the area. owne... 2016-11-09 23:41:55
176 x-1wBBwja9l2Hr5bgqsG0A L-2Qdi16eMRbATGDP6ADHg mQvRi0nm84Www71d4qOheQ 5 1 0 1 Excellent food. We tried three appetizers and ... 2016-07-03 21:13:10
674 YFp9hHkElfJGvvdo5T9MuA 98jv8gu7kAwa2WzIPdw6-w _RwlMTw9uFeOkfX9Ctf1HA 3 0 0 0 While I generally support independent restaura... 2015-01-04 00:57:18
review_id                                 vqmhsvXK9z4TTvnVDNpPDQ
user_id                                   ziNigH8BY9gRDvrmSsJTOw
business_id                               ICqgjbOpBD9SUtE5PQC9sA
stars                                                          5
useful                                                         0
funny                                                          0
cool                                                           0
text           Fun, no-frills atmosphere, right on the water,...
date                                         2018-02-23 23:06:03
Name: 235, dtype: object

Prévualisation du DataSet: "tip"

user_id business_id text date compliment_count
300 Um5bfs5DH6eizgjH3xZsvg Dv1SSVUWj1qmvAaSuRiCdg Best brunch in the Clearwater area. Come and c... 2014-03-14 16:57:19 0
945 X1nvKXUJ5Lp3W9Oe-_JrMQ fo4TOAiwYEZ5p13kqg1Ukw Use the phone app and preload for ease of paym... 2013-12-29 18:16:29 0
666 cLCMvwsFgKx2f6rrK3boOQ 434A83c2ig6QxsZjrjclpQ Spinach dip is amazing! Come at least 30 minut... 2013-08-23 14:10:54 0
615 0xZucjnNt2beD1veIAWLwA h7Fq7pBe2uMD5doA91j6XQ Yazoo Pale Ale on draft & live music! 2011-09-30 22:09:32 0
258 njcXFGqIuSp-_joP42MhxA ltBBYdNzkeKdCNPDAsxwAA The kids are eating...it's a miracle... 2011-08-20 19:29:22 0
user_id                                        weB8wGdi1A1SXh8CMCXDTw
business_id                                    VZWuhqiPJCZmHfJmwdiCGA
text                Thursday - 1/2 price wine and fish & chip special
date                                              2018-03-30 02:36:09
compliment_count                                                    0
Name: 689, dtype: object

Prévualisation du DataSet: "user"

user_id name review_count yelping_since useful funny cool elite friends fans ... compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos
73 KxrKVxdXGkfMJ9XwJZzoLQ Lisa 950 2008-10-16 21:03:02 1243 368 527 2010,2011,2012,2013,2014,2015,2016,2017,2018,2... EWNe5k2pLEefqWnAdC4_1A, C44UVPGmzusO4_a576Wa6g... 35 ... 2 1 1 1 26 51 34 34 11 5
436 fHS0bQ-l5rHME_xXKQSYXQ Kevin 1401 2007-03-19 18:19:11 7875 3954 6616 2007,2008,2009,2010,2011,2012 4Zi2HXp_uEjAgJHTvIsCXg, BWsutShwFQiQMoITF9IMOg... 383 ... 49 31 46 75 515 1589 947 947 264 60
633 xQXUG94oRQxYUHZwS6Cwzg Michelle 943 2009-05-31 05:08:55 4092 2315 2561 2009,2010,2011,2012,2013,2014,2015,2016,2017 NdHGV2JmZmhYG1tSCRwrBg, wr1Y8-yLVCoaDKR4ihiTGg... 69 ... 64 13 15 6 357 537 561 561 180 34
700 0iq45WV_h_j8SXjR4ytcJg A 186 2009-12-08 23:21:45 669 301 379 dXpgRqVIJZZcD87Tf9DtUQ, 5pDJuri4g3Wfes8GGlnudA... 46 ... 2 1 0 0 7 22 18 18 4 4
192 NRrNQ5xHn_7Fu4ctlpKLbQ Justin 1002 2006-12-31 06:40:43 1256 639 549 2015,2016,2017,2018,2019,20,20,2021 tVzBb8_2bkknwBZIbSq3hQ, vQ4IV9xP_t-lfj0eGE46hA... 51 ... 6 2 2 1 26 50 48 48 16 3

5 rows × 22 columns

user_id                                          z3yOJaNdvqzXc6L1RbTl_w
name                                                            rebecca
review_count                                                        107
yelping_since                                       2006-12-04 19:03:52
useful                                                              152
funny                                                                42
cool                                                                 43
elite                                                              2010
friends               hZg-KQusgDFRrcGTRUuSWA, NFYDEgblCeBMlwdpL_UsRA...
fans                                                                  1
average_stars                                                      3.06
compliment_hot                                                        1
compliment_more                                                       2
compliment_profile                                                    0
compliment_cute                                                       1
compliment_list                                                       0
compliment_note                                                       6
compliment_plain                                                      3
compliment_cool                                                       1
compliment_funny                                                      1
compliment_writer                                                     0
compliment_photos                                                     0
Name: 644, dtype: object

No description has been provided for this image
23% des reviews ont au maximum 2 étoiles

Apperçu des catégories

  1. Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists
  2. Shipping Centers, Local Services, Notaries, Mailbox Centers, Printing Services
  3. Department Stores, Shopping, Fashion, Home & Garden, Electronics, Furniture Stores
  4. Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries
  5. Brewpubs, Breweries, Food
  6. Burgers, Fast Food, Sandwiches, Food, Ice Cream & Frozen Yogurt, Restaurants
  7. Sporting Goods, Fashion, Shoe Stores, Shopping, Sports Wear, Accessories
  8. Synagogues, Religious Organizations
  9. Pubs, Restaurants, Italian, Bars, American (Traditional), Nightlife, Greek
  10. Ice Cream & Frozen Yogurt, Fast Food, Burgers, Restaurants, Food
  11. Department Stores, Shopping, Fashion
  12. Vietnamese, Food, Restaurants, Food Trucks
  13. American (Traditional), Restaurants, Diners, Breakfast & Brunch
  14. General Dentistry, Dentists, Health & Medical, Cosmetic Dentists
  15. Food, Delis, Italian, Bakeries, Restaurants
  16. Sushi Bars, Restaurants, Japanese
  17. Automotive, Auto Parts & Supplies, Auto Customization
  18. Vape Shops, Tobacco Shops, Personal Shopping, Vitamins & Supplements, Shopping
  19. Automotive, Car Rental, Hotels & Travel, Truck Rental
  20. Korean, Restaurants
  21. etc...

Apperçu de quelques reviews par note

Note = 1


They have the WORST service advisors! Used to be good before Kelly and her team left. Unfortunately, it's convenient to work if I need oil change before I can make it to another Honda dealer.
It is unfortunate that with such a unique location and such a brand and product offering this specific store offers such lousy service. The wait is endless, no one is available to help and at Christmas time getting a gift wrap is act of God that requires endless wait. I bought gifts and knew that the wait for wrapping would be long SO I even left my items at the store to be gift wrapped at their leisure. They were not even moved from the counter where I bought them when I returned almost two hours later ready for pick up. This was a gift that needed to be given and The staff COMPLETELY "dropped the ball" on my time constraints! I love their stuff, but today was my last shopping experience at this location: couldn't get a gift wrapped after being assured that it could be done in a timely fashion??? I'll cancel my card, do everything online and try not to go there if I can. It's really a shame!

Note = 2


We arrived for lunch at 12p on a Friday - wasn't busy at all. It took FOREVER to get anything. We were really pleasant and kind but we had to complain several times to get any beverages that we ordered. It took them 40 minutes to let us know that a beer we ordered was no longer available. The food was great, but not worth the lengthy wait and slow service.
This use to be a reliable place for sandwiches but the last two were not good at all. Cheesesteak was light on the meat. There are so many good rolls to pick from in the Philly area but this roll was one of the worst and stale. Hopefully it was just an off night. I'll try again but not for awhile.

Note = 3


Honestly the food doesn't knock my socks off but other people seem to love this place. I go because my husband likes it as for me I'd rather go to a different BBQ spot. I guess it also depends on what you order.
If not for the pretentious, haughty, superior attitude of our waiter, I would have given this place four stars or possibly more... Seriously. That kind of attitude is exactly why I left New York. We wanted to order a bottle of wine, asked for his suggestion, and he answers with, "How much do you want to spend?" Ummm... Excuse me? How about, "I'm happy to make some recommendations. Let's find a price point you're comfortable with..." He didn't smile ONCE during the meal service and also found it necessary to correct us on several points of preference. Snob. The food is GOOD. The only thing that was great was the blueberry lasagna. And it is superb. So was the chocolate confection for dessert. I'd say go. And I hope you don't get that waiter.

Note = 4


I love the concept of this place. One half of it a café that serves different types of coffees and teas, and breakfast type items, and the other world where it serves bar type apps, salads, and sandwiches, Can't forget that they also serve a wide variety of beers that they serve on tap, bottled, or in can. This place is pretty darn casual, and one can hang out here for hours with their friends. Reminds me of the good ol' college days when no time could pass especially on a chill out Sunday Fun day. Bring your four legged friend too. They are totally welcome.
We had an alumni event here and I really enjoyed it. It's dimly lit, very cozy inside - it's decorated kind of like an old library. It was a great quiet,casual spot if you're looking for a low key place to have good drinks. They have a few beers and wines, but we focused mostly on the happy hour cocktails - moscow mules, old fashioneds, etc. Everyone loved theirs, no complaints!

Note = 5


We sat at a pretty hectic lunch at Johnny rockets in the casino. Our server was Lyndel! She was awesome! Helped us at every needy request lol... Good was good, too! I'm too full
Great store. Insane selection. Incredible customer service. Wish they could come to Ft. McMurray. :(

Extraction d'un échantillons de reviews¶

52268 business sont des Restaurants

Caractéristique de l'extraction

  • par chunk de 100000
  • Filtrage des reviews sur categorie "restaurants"
  • Séparation par note
  • Ajout des infos business
  • Limite en quantité: 1000 reviews par note
  • Stockage dans des fichiers parquets

Nuages de mots par note¶

Note = 5

No description has been provided for this image

Note = 4

No description has been provided for this image

Note = 3

No description has been provided for this image

Note = 2

No description has been provided for this image

Note = 1

No description has been provided for this image

Recherche des sujets d'insatisfactions¶

Sac de mots (TF-IDF)¶

Vectorisation pour les reviews comportant maximum 2 étoiles
Nombre de textes dupliqués supprimés: 0
Il y a 2000 enregistrements

Vecteurs TF-IDF des reviews:
====================
100 1st 2nd able absolute absolutely accept accommodate acknowledge across ... yelp yes yesterday yet york young yuck yummy zero zero star
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.231769 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.176228 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0

5 rows × 1901 columns

+------------------------------------------------------------------------------------------------------+--------------------------+
| text review                                                                                          | vecteur tf-idf           |
+======================================================================================================+==========================+
| You don't accept cash?  I don't think you grasp the ramifications of such a corpo-fascist economic   | accept: 0.2965           |
| principle.  No room for arrogant commies in my diet thank you so very very little.  Can't wait to    | anyone: 0.2395           |
| see this place nosedive  For anyone in the dark about this policy,  watch Mike Judge's film,         | attention: 0.2499        |
| Idiocracy. Pay close attention throughout the hospital scene.   "Unscannable!!!"                     | cash: 0.2758             |
|                                                                                                      | close: 0.1923            |
|                                                                                                      | dark: 0.2780             |
|                                                                                                      | diet: 0.3070             |
|                                                                                                      | little: 0.1816           |
|                                                                                                      | pay: 0.1711              |
|                                                                                                      | room: 0.2075             |
|                                                                                                      | see: 0.1698              |
|                                                                                                      | thank: 0.2486            |
|                                                                                                      | throughout: 0.2851       |
|                                                                                                      | very little: 0.2826      |
|                                                                                                      | very very: 0.3111        |
|                                                                                                      | watch: 0.2318            |
+------------------------------------------------------------------------------------------------------+--------------------------+
| Visiting from out of town I was excited to visit Chef Milly's newly opened restaurant.  It was a     | chef: 0.3954             |
| sore DISAPPOINTMENT. Waited even w/a reservation and tried to be patient since they had just         | definitely: 0.1777       |
| recently opened.  Chef Milly does not even offer a smile to his visitors.  The food was just OK and  | definitely not: 0.2266   |
| definitely not worth the wait.  I will commend my server (I think his name was Ty) on his            | disappointment: 0.1942   |
| professionalism even amongst the multiple jobs he was tasked.  Chef Ramsey would be highly           | excite: 0.1864           |
| disappointed - I surely was.                                                                         | highly: 0.2429           |
|                                                                                                      | job: 0.1992              |
|                                                                                                      | multiple: 0.2079         |
|                                                                                                      | name: 0.1809             |
|                                                                                                      | not even: 0.1912         |
|                                                                                                      | not worth: 0.1877        |
|                                                                                                      | offer: 0.1547            |
|                                                                                                      | open: 0.2541             |
|                                                                                                      | recently: 0.2150         |
|                                                                                                      | reservation: 0.1983      |
|                                                                                                      | since: 0.1442            |
|                                                                                                      | smile: 0.2250            |
|                                                                                                      | town: 0.1905             |
|                                                                                                      | visit: 0.2657            |
|                                                                                                      | worth: 0.1579            |
|                                                                                                      | worth wait: 0.2542       |
+------------------------------------------------------------------------------------------------------+--------------------------+
| Cashier was very rude fat Hispanic girl at the airport location. Very very rude. Half the menu       | airport: 0.3057          |
| wasn't even available.                                                                               | available: 0.2564        |
|                                                                                                      | cashier: 0.2867          |
|                                                                                                      | fat: 0.2803              |
|                                                                                                      | girl: 0.2364             |
|                                                                                                      | half: 0.1883             |
|                                                                                                      | location: 0.1860         |
|                                                                                                      | rude: 0.3302             |
|                                                                                                      | very rude: 0.5024        |
|                                                                                                      | very very: 0.3208        |
|                                                                                                      | wasn even: 0.2996        |
+------------------------------------------------------------------------------------------------------+--------------------------+
| The worst customer service and nasty employees EVER!!!  SLOW SERVICE!!!Missing items and cashier not | call: 0.1619             |
| giving correct change.  I felt like she wanted my $10.  Manager was no better...nasty and said to    | care: 0.1853             |
| call and complain...she did NOT care. NEVER again!!!!                                                | cashier: 0.2566          |
|                                                                                                      | change: 0.1931           |
|                                                                                                      | complain: 0.2101         |
|                                                                                                      | correct: 0.2343          |
|                                                                                                      | customer service: 0.1953 |
|                                                                                                      | employee: 0.1886         |
|                                                                                                      | ever: 0.1548             |
|                                                                                                      | felt: 0.1931             |
|                                                                                                      | felt like: 0.2426        |
|                                                                                                      | item: 0.1942             |
|                                                                                                      | manager: 0.1600          |
|                                                                                                      | miss: 0.2079             |
|                                                                                                      | nasty: 0.4218            |
|                                                                                                      | not care: 0.2872         |
|                                                                                                      | not give: 0.2767         |
|                                                                                                      | slow: 0.1819             |
|                                                                                                      | slow service: 0.2587     |
+------------------------------------------------------------------------------------------------------+--------------------------+
| Food was just about ok. Service very average and slow. Also, charging extra for something as simple  | average: 0.2527          |
| as soy sauce is a bit strange when their other condiments were complimentary. Bar staff was good     | bar: 0.2019              |
| though.                                                                                              | bite: 0.2135             |
|                                                                                                      | charge: 0.2370           |
|                                                                                                      | extra: 0.2534            |
|                                                                                                      | food service: 0.2887     |
|                                                                                                      | good though: 0.3770      |
|                                                                                                      | sauce: 0.2010            |
|                                                                                                      | service very: 0.3321     |
|                                                                                                      | simple: 0.3032           |
|                                                                                                      | slow: 0.2354             |
|                                                                                                      | something: 0.2109        |
|                                                                                                      | strange: 0.3437          |
|                                                                                                      | though: 0.2078           |
+------------------------------------------------------------------------------------------------------+--------------------------+

LDA (Librairie Sklearn)¶

Recherche des sujets avec les paramètres suivants:

param        valeur
-----------  --------
max_stars    2
min_df       2
max_df       0.1
n_topics     3
alpha        0.5
n_top_words  5
ngram_range  (1, 1)

  - Vectorisation (tf-idf)
  - Modélisation LDA
  - Affichage des topics

  Topic n°  Categories
----------  ---------------------------------------
         0  pizza, waitress, burger, sauce, bar
         1  goopy, mahi, vista, isla, cashew
         2  donut, fancy, environment, refer, bueno
Recherche des sujets avec les paramètres suivants:

param        valeur
-----------  --------
max_stars    2
min_df       2
max_df       0.2
n_topics     3
alpha        0.8
n_top_words  5
ngram_range  (3, 3)

  - Vectorisation (tf-idf)
  - Modélisation LDA
  - Affichage des topics

  Topic n°  Categories
----------  ----------------------------------------------------------------------------------------------
         0  would not recommend, waste time money, wait another minute, not recommend place, give one star
         1  not very good, want like place, food good service, give two star, nothing write home
         2  take minute get, not worth wait, service very slow, food nothing special, not worth price

LDA (Librairie Gensim)¶

Recherche des sujets avec les paramètres suivants:

param      valeur
---------  --------
max_stars  2
no_below   2
no_above   0.2
n_topics   3
n_grams    [2, 3]

  - Préparation des data (preprocess tokenisation...)
  - LDA pour 3 topics
+------------+--------------------------+
|   Topic n° | mots clés                |
+============+==========================+
|          1 | 0.004*"taste like"       |
|            | 0.004*"come back"        |
|            | 0.003*"win back"         |
|            | 0.003*"food not"         |
|            | 0.002*"not good"         |
|            | 0.002*"mac cheese"       |
|            | 0.002*"wait staff"       |
|            | 0.001*"felt like"        |
|            | 0.001*"very good"        |
|            | 0.001*"place not"        |
+------------+--------------------------+
|          2 | 0.002*"not worth"        |
|            | 0.002*"look like"        |
|            | 0.002*"next time"        |
|            | 0.002*"much good"        |
|            | 0.002*"very disappoint"  |
|            | 0.002*"not sure"         |
|            | 0.002*"first time"       |
|            | 0.002*"come back"        |
|            | 0.002*"mash potato"      |
|            | 0.002*"not return"       |
+------------+--------------------------+
|          3 | 0.003*"get food"         |
|            | 0.003*"food good"        |
|            | 0.003*"take order"       |
|            | 0.002*"wait minute"      |
|            | 0.002*"would not"        |
|            | 0.002*"take minute"      |
|            | 0.002*"customer service" |
|            | 0.002*"last night"       |
|            | 0.002*"place order"      |
|            | 0.002*"drink order"      |
+------------+--------------------------+
Out[21]:
  - Préparation des data (preprocess tokenisation...)
  - LDA pour 3 topics
+------------+-----------------------------+
|   Topic n° | mots clés                   |
+============+=============================+
|          1 | 0.009*"never come back"     |
|            | 0.008*"waste time money"    |
|            | 0.005*"could give zero"     |
|            | 0.004*"would not recommend" |
|            | 0.004*"get money back"      |
|            | 0.004*"take drink order"    |
|            | 0.004*"buy one get"         |
|            | 0.004*"not come back"       |
|            | 0.003*"take minute get"     |
|            | 0.003*"get order right"     |
+------------+-----------------------------+
|          2 | 0.005*"want like place"     |
|            | 0.005*"take minute get"     |
|            | 0.004*"speak manager tell"  |
|            | 0.004*"say didn know"       |
|            | 0.004*"never come back"     |
|            | 0.004*"make eye contact"    |
|            | 0.004*"would not recommend" |
|            | 0.004*"could give zero"     |
|            | 0.004*"waste time money"    |
|            | 0.004*"wish could give"     |
+------------+-----------------------------+
|          3 | 0.005*"give zero star"      |
|            | 0.005*"come take order"     |
|            | 0.005*"never come back"     |
|            | 0.004*"order wait minute"   |
|            | 0.004*"really want like"    |
|            | 0.003*"new york pizza"      |
|            | 0.003*"waste time money"    |
|            | 0.003*"food good not"       |
|            | 0.003*"take drink order"    |
|            | 0.003*"didn even eat"       |
+------------+-----------------------------+
Out[22]:

Classifications des images¶

Prévisualisation du dataset¶

Out[23]:
photo_id business_id caption label
190211 ZLVGQMk0Z-OeFNWtlQFpGA 1Nvx5xo_cErlEqpubzocSg Southwest Chorizo Burger food
159695 e-DRAYViNXQoFMtIRbR7ag rElxptPIJZicDM39e1ORTg food
177849 hwlTSLEySvQnlvanyxXOyQ EJ1r6E92bw7khcMSPH80rA inside
107714 oexZE1WbqWnOO8bEnFVIaQ WnVNjr9zVEpK85T7dbAfEg Still Life of Roll. Oh, and Wasabi ball. food
197858 Vua3uNjizuCSsArRn1h6qg p4zm3a5-Ei8wjUV_KZq23w inside
Out[24]:
(200100, 4)
Out[25]:
label
food       108152
inside      56031
outside     18569
drink       15670
menu         1678
Name: count, dtype: int64

Creation dataset echantillons¶

L'échantillon contient 500 images
No description has been provided for this image
Out[27]:
photo_id label width height mode label_num
117 yZX66Ykboo4jdWOSCW70vA outside 300.0 400.0 RGB 4
461 dTw0ZNmAetdtAeFzQqtI0w menu 265.0 400.0 RGB 3
325 4ASMLlOMvMPohnjP4yFYMw food 533.0 400.0 RGB 1
436 AHZeI38pU1QhUGYDxQ69jQ menu 533.0 400.0 RGB 3
426 jEa3Y6D_YHrXu8ZDzHFOVw menu 300.0 400.0 RGB 3
Out[28]:
width height label_num
count 500.000000 500.000000 500.00000
mean 438.882000 389.688000 2.00000
std 131.985303 32.814085 1.41563
min 131.000000 69.000000 0.00000
25% 300.000000 400.000000 1.00000
50% 408.000000 400.000000 2.00000
75% 543.750000 400.000000 3.00000
max 600.000000 400.000000 4.00000
Out[29]:
mode
RGB    500
Name: count, dtype: int64

Clustering par descripteurs SIFT¶

Pretraitement des images¶

Exemple de pre-traitement¶

Image et son histogramme avant traitement

No description has been provided for this image
No description has been provided for this image

Image et son histogramme après traitement

No description has been provided for this image
No description has been provided for this image

Creation des descripteurs¶

Exemple de descripteur

No description has been provided for this image
Descripteurs :  (501, 128)

[[ 0.  0.  0. ...  1. 18. 18.]
 [13. 14.  8. ... 24.  0.  1.]
 [24.  0.  0. ...  2.  5. 25.]
 ...
 [ 0.  0.  0. ...  0.  2.  8.]
 [ 0.  0.  0. ...  0.  2. 13.]
 [ 0.  0. 10. ...  0.  0. 12.]]
photo_id label width height mode label_num desc
360 J4fwj6iamJ7mOCzYIQCujw food 533.0 400.0 RGB 1 [[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0,...
214 SqInyQ4-CgIRXhXPJzjr0w drink 600.0 400.0 RGB 0 [[42.0, 74.0, 7.0, 12.0, 22.0, 21.0, 41.0, 40....
385 AXPr8IBkgPg4Y19uzlFakw food 504.0 337.0 RGB 1 [[17.0, 2.0, 2.0, 2.0, 118.0, 58.0, 5.0, 18.0,...

Clustering des descripteurs¶

Principe:

  • Il s'agit de regrouper tous les descripteurs en clusters
  • Les clusters serviront ensuite à classifier les images par degré d'appartenance à chaque cluster
Il y a 494 clusteurs pour un total de 244744 descripteurs

Creations des features des images¶

Principes:

  • On attribut chacun des descripeurs de l'image à un des clusters de descripteur
  • Pour chacun des clusteurs on compte combien l'image contient de descripteur de ce clusteur
  • On peut le visualiser en forme d'histogramme et utiliser celui comme features de l'image
Out[38]:
photo_id label width height mode label_num desc features
294 V7yLVqDLuVC_D0xbjQcXXw drink 597.0 400.0 RGB 0 [[8.0, 1.0, 1.0, 1.0, 16.0, 33.0, 11.0, 7.0, 1... [4, 2, 0, 2, 3, 1, 0, 0, 0, 7, 2, 0, 0, 0, 1, ...
400 huZEPkPbcgdtSKI7gclDhw menu 300.0 400.0 RGB 3 [[0.0, 1.0, 112.0, 37.0, 1.0, 0.0, 0.0, 0.0, 1... [8, 4, 1, 0, 0, 0, 1, 1, 0, 5, 4, 1, 1, 0, 2, ...
219 MSzntFiQ1aQUohWHNrkt1A drink 339.0 400.0 RGB 0 [[26.0, 22.0, 8.0, 1.0, 1.0, 8.0, 41.0, 68.0, ... [0, 0, 0, 0, 2, 4, 6, 2, 0, 1, 1, 0, 0, 0, 2, ...
No description has been provided for this image

Reduction de dimension puis clustering¶

Reduction PCA
En concervant 99.0% de la variance, la PCA réduit les features de 494 composantes à 343 composantes
Reduction TSNE en 2 dimensions
No description has been provided for this image
Clustering
Affichage des clusters
No description has been provided for this image
No description has been provided for this image

Adjusted rand score = 0.068

Clustering par CNN¶

Pretraitement des images¶

Exemple de pré-traitement¶

Image avant traitement

No description has been provided for this image
Taille originale => Hauteur: 400, Largeur: 533

Image après traitement

No description has been provided for this image
Taille ajustée => Hauteur: 400, Largeur: 600

Creation des features depuis cnn VGG16¶

Principe:

  • On utilise VGG16 sans la partie top (sans le reseau dense)
  • On extrait un vecteur 1*512 du reseau CNN pour chaque image par prédiction
  • Ce vecteur représente les features de l'image: comme pour sift on réduit en 2 dimensions puis on crée des clusters des images
Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 400, 600, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 400, 600, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 400, 600, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 200, 300, 64)      0         
                                                                 
 block2_conv1 (Conv2D)       (None, 200, 300, 128)     73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 200, 300, 128)     147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 100, 150, 128)     0         
                                                                 
 block3_conv1 (Conv2D)       (None, 100, 150, 256)     295168    
                                                                 
 block3_conv2 (Conv2D)       (None, 100, 150, 256)     590080    
                                                                 
 block3_conv3 (Conv2D)       (None, 100, 150, 256)     590080    
                                                                 
 block3_pool (MaxPooling2D)  (None, 50, 75, 256)       0         
                                                                 
 block4_conv1 (Conv2D)       (None, 50, 75, 512)       1180160   
                                                                 
 block4_conv2 (Conv2D)       (None, 50, 75, 512)       2359808   
                                                                 
 block4_conv3 (Conv2D)       (None, 50, 75, 512)       2359808   
                                                                 
 block4_pool (MaxPooling2D)  (None, 25, 37, 512)       0         
                                                                 
 block5_conv1 (Conv2D)       (None, 25, 37, 512)       2359808   
                                                                 
 block5_conv2 (Conv2D)       (None, 25, 37, 512)       2359808   
                                                                 
 block5_conv3 (Conv2D)       (None, 25, 37, 512)       2359808   
                                                                 
 block5_pool (MaxPooling2D)  (None, 12, 18, 512)       0         
                                                                 
 global_max_pooling2d (Globa  (None, 512)              0         
 lMaxPooling2D)                                                  
                                                                 
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
_________________________________________________________________
Out[50]:
photo_id label width height mode label_num features
19 k6wBjugZfGi1MAIZyYTSBw inside 533.0 400.0 RGB 2 [37.16306, 51.73715, 53.12264, 19.078745, 33.1...
64 LmC1GJBNbZypLlXGwSFSuQ inside 600.0 400.0 RGB 2 [31.974394, 14.303509, 62.38606, 30.424088, 64...
446 EnRmcgfLoERZshTb2NJBSg menu 575.0 400.0 RGB 3 [55.595566, 30.554121, 26.499107, 0.0, 28.5950...
78 ZYpOugZB43r7TF2HwqqauQ inside 533.0 400.0 RGB 2 [21.98121, 33.64898, 107.18874, 20.312721, 62....
407 wPEpDpRggdc1FCqgNHlN3w menu 300.0 400.0 RGB 3 [32.333332, 31.162899, 16.601622, 0.0, 113.254...

Test 1: PCA -> TSNE -> KMEAN¶

Reduction PCA
En concervant 99.0% de la variance, la PCA réduit les features de 512 composantes à 339 composantes
Reduction TSNE en 2 dimensions
No description has been provided for this image
Clustering
Affichage des clusters
No description has been provided for this image
No description has been provided for this image

Adjusted rand score = 0.578

Test 2: KMEAN -> PCA -> TSNE¶

Clustering
Reduction PCA
En concervant 99.0% de la variance, la PCA réduit les features de 512 composantes à 339 composantes
Reduction TSNE en 2 dimensions
No description has been provided for this image
Affichage des clusters
No description has been provided for this image
No description has been provided for this image

Adjusted rand score = 0.318

Test 3: PCA (50% variance) -> KMEAN -> TSNE¶

Reduction PCA
En concervant 50.0% de la variance, la PCA réduit les features de 512 composantes à 21 composantes
Clustering
Reduction TSNE en 2 dimensions
No description has been provided for this image
Affichage des clusters
No description has been provided for this image
No description has been provided for this image

Adjusted rand score = 0.269

Récupération des données depuis l'API YELP¶

Principes:

  • Faire une 1ere requete sur le point de terminaison "search" pour extraire 200 id de restaurant (une boucle avec offset sera néésssaire car le max par requete est 50)
  • Faire une seconde requete en boucle sur les id des restaurants sur le point de terminaison "reviews" (3 reviews max en version gratuite sont données)
  • Mettre les data dans des DataFrames puis sauver ces DataFrames en fichier parquet
Lecture depuis les fichiers de la dernière éxécution (à cause de la limite journalière de l'API en version gratuite)
Extrait des reviews provenant de l'API YELP
text rating
427 Absolutely blown away by everything: from the ... 5
130 Great place for food and cocktails, highly rec... 5
598 Good service, food, music and ambiance. I ate ... 5
201 Food was excellent. Taste was just right and n... 5
55 Le Comptoir was recommended to me by a friend.... 4
Il y a 600 enregistrements pour 200 restaurants de la ville de Paris
(NB: l'api YELP ne fournit que 3 reviews par business id en version gratuite)